智能论文笔记

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Samuel Cahyawijaya , Holy Lovenia , Alham Fikri Aji , Genta Indra Winata , Bryan Wilie , Rahmad Mahendra , Christian Wibisono , Ade Romadhony , Karissa Vincentio , Fajri Koto

分类：自然语言处理 | 人工智能

2022-12-19

We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.

translated by 谷歌翻译

在阻止印尼自然语言处理（NLP）研究进步的基本问题的中心，我们发现数据稀缺。印尼语言，尤其是当地语言的资源极为稀缺和代表性不足。许多印尼研究人员没有发布其数据集。此外，我们拥有的少数公共数据集散布在不同的平台上，因此使印尼NLP的可重复性和以数据为中心的研究更加艰巨。面对这一挑战，我们开始了第一个印尼NLP众包努力，Nusacrowd。Nusacrowd努力为所有印尼语言中的NLP任务提供标准化数据加载，以提供最大的数据表聚合。通过使印尼NLP资源的开放式和集中式访问能力，我们希望Nusacrowd可以解决阻碍印度尼西亚NLP进展的数据稀缺问题，并将NLP从业者带来合作。

translated by 谷歌翻译

深度学习的巨大进步导致了跨越众多领域的前所未有的成就。虽然深度神经网络的性能是可培制的，但这种模型的架构设计和可解释性是非竞争的。已经引入了通过神经结构搜索（NAS）自动化神经网络架构的设计。最近的进展通过利用分布式计算和新颖的优化算法，这些方法更加务实。但是，在优化架构以获得可解释性的情况下几乎没有作用。为此，我们提出了一种多目标分布式NAS框架，可针对任务性能和内省进行优化。我们利用非主导的分类遗传算法（NSGA-II）并说明可以通过人类更好地理解的造成架构的AI（XAI）技术。框架在几个图像分类数据集上进行评估。我们展示了对内省能力和任务错误的联合优化，导致更具脱屑的体系结构，可在可容忍的错误中执行。

translated by 谷歌翻译